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INTRODUCTION 



Abstract. We present and analyse three online algorithms for learning in discrete Hidden Markov 
Models (HMMs) and compare them with the Baldi-Chauvin Algorithm. Using the Kullback-Leibler 
divergence as a measure of generalisation error we draw learning curves in simplified situations. 
The performance for learning drifting concepts of one of the presented algorithms is analysed 
and compared with the Baldi-Chauvin algorithm in the same situations. A brief discussion about 
learning and symmetry breaking based on our results is also presented. 
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r^- Hidden Markov Models (HMMs) [1,2] are extensively studied machine learning models 

for time series with several applications in fields like speech recognition [2], bioinfor- 
matics [3, 4] and LDPC codes [5]. They consist of a Markov chain of non-observable 
hidden states q t 6 S, t = 1,...,T, S = {si,s 2 ,...,s n }, with initial probability vector 
7Ti = V(qi = and transition matrix A^t) = V{q t +\ = Sj\q t = Sj), i,j = l,..,n. At 
discrete times t, each q t emits an observed state y t £ O, O = {oi,...,o m }, with emis- 
sion probability matrix B ia {t) = V(y t = o a \q t = Sj), i = l,...,n, a = l,...,m, which are 
the actual observations of the time series represented, from time t = 1 to t = T, by the 
observed sequence y\ = {yi,y2,...,yT}- The q t 's form the so called hidden sequence 
q[ = {qi, q 2 , qr}- The probability of observing a sequence yf given cu = (n, A, B) is 

T 

v{yl\uj) = Y,nyi)nyi\qi)X{nqt + i\qt)nyMt). (D 

q T t=2 

In the learning process, the HMM is fed with a series and adapts its parameters to 
produce similar ones. Data feeding can range from offline (all data is fed and parameters 
calculated all at once) to online (data is fed by parts and partial calculations are made). 

We study a scenario with data generated by a HMM of unknown parameters, an ex- 
tension of the student-teacher scenario from neural networks. The performance, as a 
function of the number of observations, is given by how far, measured by a suitable cri- 



terion, is the student from the teacher. Here we use the naturally arising Kullback-Leibler 
(KL) divergence that, although not accessible in practice since it needs knowledge of the 
teacher, is an extension of the idea of generalisation error being very informative. 

We propose three algorithms and compare them with the Baldi-Chauvin Algorithm 
(BC) [6]: the Baum-Welch Online Algorithm (BWO), an adaptation of the offline Baum- 
Welch Reestimation Formulas (BW) [1] and, starting from a Bayesian formulation, an 
approximation named Bayesian Online Algorithm (BOnA), that can be simplified again 
without noticeable lost of performance to a Mean Posterior Algorithm (MPA). BOnA 
and MPA, inspired by Amari [7] and Opper [8], are essentially mean field methods [9] 
in which a manifold of prior tractable distributions is introduced and the new datum 
leads, through Bayes theorem, to a non-tractable posterior. The key step is to take as the 
new prior, not the posterior, but the closest distribution (in some sense) in the manifold. 

The paper is organised as follows: first, BWO is introduced and analysed. Next, we 
derive BOnA for HMMs and, from it, MPA. We compare MPA and BC for drifting con- 
cepts. Then, we discuss learning and symmetry breaking and end with our conclusions. 



BAUM-WELCH ONLINE ALGORITHM 



The Baum-Welch Online Algorithm (BWO) is an online adaptation of BW where in 
each iteration of BW, y becomes y p , the p-th observed sequence. Multiplying the BW 
increment by a learning rate r\ B w we get the update equations for oo 
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with Au p the BW variations for y p . The complexity of BWO is polynomial in n and T. 

In figure 1, the HMM learns sequences generated by a teacher with n = 2, m = 3 and 
T = 2 for different t]bw- Initial students have matrices with all entries set to the same 
value, what we call a symmetric initial student. We took averages over 500 random 
teachers and distances are given by the KL-divergence between two HMMs u\ and a> 2 
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We see that after a certain number of sequences the HMM stops learning, which is 
particular to the symmetric initial student and disappears for a non-symmetric one. 

Denoting the variation of the parameters in BC by A, in BW by A, in BWO by A, 
and with ^ t (i) = V(q t = Si\y p ,Lu p ), we have to first order in A 
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FIGURE 1. Log-log curves of BWO for three different r\ BW indicated next to the curves. 

For r] BW ~ Xr/Bc/n and small A, variations in BC are proportional to those in BWO, 
but with different effective learning rates for each matrix depending on y p . Simulations 
show that actual values are of the same order of approximated ones. 

THE BAYESIAN ONLINE ALGORITHM 

The Bayesian Online Algorithm (BOnA) [8] uses Bayesian inference to adjust u in the 
HMM using a data set D P = {y 1 , ...,y p }. For each data, the prior distribution is updated 
by Bayes' theorem. This update takes a prior from a parametric family and transforms it 
in a posterior which in general has no longer the same parametric form. The strategy used 
by BOnA is then to project the posterior back into the initial parametric family. In order 
to achieve this, we minimise the KL-divergence between the posterior and a distribution 
in the parametric family. This minimisation will enable us to find the parameters of the 
closest parametric distribution by which we will approximate our posterior. The student 
HMM uj parameters in each step of the learning process are estimated as the means of 
the each projected distribution. 

For a parametric family that has the form P(x) oc e~^ Xi ^ x \ which can be obtained 
by the MaxEnt principle where we constrain the averages over P(x) of arbitrary func- 
tions fi(x), minimising the KL-divergence turns out to be equivalent to equating the 
averages < fi(x) > over P(x) to the average of these functions over the unprojected 
posterior (our posterior distribution just after the Bayesian update for the next data). 

For HMMs, the vector it and each i-th row A 1 of A and B % of B are different discrete 
distributions which we assume independent in order to write the factorized distribution 

n 

V(u\u) = V(n\p) YlviAWVi&lb'), (5) 

i=i 

where u = (p, a, b) represents the parameters of the distributions. 

As each factor is a distribution over probabilities, the natural choice are the Dirichlet 
distributions, which for a iV-dimensional variable x is 

V{x\u)= 1 {Uq) T\x u r\ (6) 



where u = J2i u i an ^ T is the analytical continuation of the factorial to real numbers. 
These can be obtained from MaxEnt with fi(x) = lnxj [13]: 



(7) 



The function to be extremized is 



(8) 



and with 5C/5V = we get the Dirichlet with normalisation e A+1 and m — 1 — Aj. 

Each factor distribution is separately projected by equating the average of the loga- 
rithms in the original posterior Q and in the projected distributions 



ip(aij)-ii) (y^.ciikj = (In A i:j ) Q = (a), 
where ip(x) = d\nT(x)/dx is the digamma function. We call a set of N equations 



(9) 
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with % — 1, ...N a digamma system in the variables x { with coefficients fa. 

Let us call P p {u) the projected distribution after observation of y p , and Q p+1 (u) the 
posterior distribution (not projected yet) after y p+1 . By Bayes' theorem, 



Q p +V) oc P p (u) ^V{y p+ \q p+l \uj). 
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The calculation of pi's in (9) leads to averages over Dirichlets of the form [10] 

r(uo) Uj^j + rj) 
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To solve (10), we solve for x i5 sum over % with x = ^2 i x i an( l ^ n( ^ numerically, by 
iterating from an arbitrary initial point, the fixed points of the one-dimensional map 
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FIGURE 2. Comparison in log-log scale of MPA (dashed line) and BOnA (circles). 



where we found a unique solution except for ^ 0, which is rare in most applications. 

BOnA has a common problem of Bayesian algorithms: the sum over hidden vari- 
ables makes the complexity scales exponentially in T. Also, the calculation of several 
digamma functions is very time consuming. In the following, we develop an approxi- 
mation that runs faster, although still with exponential complexity in T. This is not a 
problem for we can make T constant and the algorithm will scale polynomially in n. 



MEAN POSTERIOR APPROXIMATION 

The Mean Posterior Approximation (MPA) is a simplification of BOnA inspired in 
its results for Gaussians, where we match first and second moments of posterior and 
projected distributions. Noting it, instead of minimising d KL we match the mean and 
one of the variances of posterior and projected distributions as an approximation, which 
gives, with hatted variables for reestimated values [10] 

Pi = w/q— / nr> ( 14 ) 

dij = (aij)g 

ka = (b ia ) Q — J7-T2, 
\ b illQ - \ b HlQ 

with complexity again of order n T , but with heavily reduced real computational time 
making it better for practical applications. 

Figure 2 compares MPA and BOnA. The initial difference decreases in time and both 
come closer relatively fast. We used n = 2, m = 3 and T = 2 and averaged over 150 
random teachers with symmetric initial students. The computational time for BOnA was 
340min, and for MPA, 5s in a 1GHz processor. Figure 3a compares MPA to BC and 
figure 3b to BWO. In both cases MPA has better generalisation. We used n = 2, m = 3, 
T = 2, symmetric initial students and averaged over 500 random teachers. 
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FIGURE 3. a) Comparison between MPA (dashed) and BC (continuous). Values of A are indicated next 
to the curves. i]bc = 0.5. b) Comparison between MPA (dashed) and BWO (continuous). Values of tjbw 
are indicated next to the curves. Both scales are log-log. 




FIGURE 4. Drifting concepts. Continuous lines correspond to MPA and dashed lines to BC. a) Abrupt 
changes at 500 sequences interval, b) Small random changes at each new sequence. 

LEARNING DRIFTING CONCEPTS 

We tested BC and MPA for changing teachers. In figure 4a, it changes at random after 
each 500 sequences (A = 0.01, rj B c = 10.0). In figure 4b, each time a sequence is 
observed, a small random quantity is added to the teacher. Both have n = 2, m = 3 
and are averaged over 200 runs. 

Figure 4b shows that BC adapts better, but is not fully adaptive and we do not know 
how to modify it. MPA instead derives from Bayesian principles and we can guess the 
problem by analogy with similar Bayesian algorithms [12]: variances decrease in the 
process as in the perceptron, where they are the learning rates, explaining the memory 
effect difficulting the learning after changes. Although not proved yet, we expect the 
same relationship in MPA, which can be used to improve performance. 



LEARNING AND SYMMETRY BREAKING 



Learning from symmetric initial students requires that the parameters separate from each 
other in some point, which depends on the algorithm and is an important feature in online 




FIGURE 5. KL-divergence and student's parameters for a) BC and b) MPA. 



algorithms [1 1], breaking the symmetry with a sharp decrease in the generalisation error. 

Instead of taking averages to smooth abrupt changes, here we draw curves for only 
one teacher, rendering them visible. Flat lines before a symmetry breaking are called 
plateaux and occur when it is difficult to break the symmetry. 

Figure 5a shows BC (A = 0.01, t]bc = 1-0) with two abrupt changes: in the beginning 
and after 1000 sequences, n and A only break the symmetry in the second point, and B 
in both. Figure 5b shows that in MPA the second change is stronger and the symmetry 
breaking affects both B and A. Figure 6 shows BWO with r\ BW = 0.01 where only B is 
affected. The more symmetries are broken, the best the generalisation of the algorithm. 

In all simulations we set n = 2, m = 3 and T = 2 with a teacher HMM given by 

*=(£)• ^Oo> *=(JS!)- <>» 

CONCLUSIONS 

We proposed and analysed three learning algorithms for HMMs: Baum- Welch On- 
line (BWO), Bayesian Online Algorithm (BOnA) and Mean Posterior Approximation 
(MPA). We showed the superior performance of MPA for static teachers, but the Baldi- 
Chauvin (BC) algorithm is better for drifting concepts, although the Bayesian nature of 
MPA suggests how to fix it. The results seem to be confirmed by initial tests on real data. 

The importance of symmetry breaking in learning processes is presented here in a 
brief discussion where the phenomenon is shown to occur in our models. 
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FIGURE 6. KL-divergence and student's parameters for BWO. 
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